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Abstract This paper proposes a technique for the un- 
supervised detection and tracking of arbitrary objects 
in videos. It is intended to reduce the need for detec- 
tion and localization methods tailored to specific ob- 
ject types and serve as a general framework applicable 
to videos with varied objects, backgrounds, and image 
qualities. The technique uses a dependent Dirichlet pro- 
cess mixture (DDPM) known as the Generalized Polya 
Urn (GPUDDPM) to model image pixel data that can 
be easily and efficiently extracted from the regions in 
a video that represent objects. This paper describes 
a specific implementation of the model using spatial 
and color pixel data extracted via frame differencing 
and gives two algorithms for performing inference in 
the model to accomplish detection and tracking. This 
technique is demonstrated on multiple synthetic and 
benchmark video datasets that illustrate its ability to, 
without modification, detect and track objects with di- 
verse physical characteristics moving over non-uniform 
backgrounds and through occlusion. 



1 Introduction 

We define unsupervised detection and tracking of arbi- 
trary objects in videos to be the task of automatically 
identifying the distinct objects present in a sequence 
of images and determining the path each object follows 
over time. Techniques that accomplish this task are use- 
ful in many fields that make use of video data, including 
robotics, video surveillance, time-lapse microscopy, and 
video summarization. By studying this task, we hope 
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to help make progress towards general machine vision 
algorithms that can learn the positions, appearances, 
and number of objects present in any video scene. 

This task can be broken down into three parts: data 
extraction, localization, and tracking. Data extraction 
is the act of extracting features from regions of video- 
frames that constitute objects, localization is the act of 
finding the positions and/or shapes of distinct objects, 
and tracking is the act of maintaining the identities of 
the detected objects over time. This paper introduces 
a new framework for carrying out these three actions 
based on a type of dependent Dirichlet process mixture 
model. This framework provides a foundation for a class 
of unsupervised algorithms that can detect and track 
arbitrary objects in a wide range of videos. 

Research related to general detection and tracking 
of objects tends to focus on one of either extraction, 
localization, or tracking. Integrating all three tasks in a 
system for multiple arbitrary objects and diverse video 
types is not often a primary focus. A few attempts at 
accomplishing the three tasks in a cohesive manner have 
been studied in recent years [8, 9, 25, 49]. This paper 
furthers this line of work by providing a model that 
could give rise to a number of algorithms to detect ar- 
bitrary objects in videos — particularly in cases where 
frame-by-frame segmentation is difficult, video quality 
is low, and extraction is noisy — and maintain the iso- 
lation of distinct objects during tracking and through 
occlusion. 

We begin by describing characteristics of the ex- 
tracted data (Section 3), and giving the generalized 
form of the model (Section 4). To implement this model, 
one must specify a data extraction procedure and dis- 
tributions for representing objects, which may be cho- 
sen to allow for arbitrary object tracking or tailored to 
a specific object type for a given application. In our 
implementation, we extract data via a basic frame dif- 
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ferencing procedure and specify distributions useful for 
representing arbitrary objects (Section 5). We describe 
inference algorithms for our model and show how the 
output of these algorithms can be interpreted as object 
localization and tracking results (Section 6). Our imple- 
mentation is demonstrated on multiple synthetic and 
benchmark datasets. Standard performance metric val- 
ues are computed to quantify results on the benchmark 
datasets (Section 7). We compare the performance met- 
rics of our method with those yielded by specialized 
detection and tracking algorithms tailored to specific 
objects in the benchmark datasets. Our results support 
our hypothesis that approaches combining simple data 
extraction and a powerful model can perform detection 
and tracking of arbitrary objects at a level comparable 
to state-of-the-art, object specific algorithms. 



2 Background 

A variety of methods in the fields of image process- 
ing, signal processing, and computer vision have been 
developed to solve aspects of the problem of unsuper- 
vised detection and tracking of arbitrary objects. These 
methods might be placed into a few broad categories: 
those that aim to distinguish the foreground regions 
of images from the background [33, 12, 65, 39], seg- 
ment images into distinct regions to perform localiza- 
tion [35, 23, 56], track an object over a sequence of im- 
ages (after its position has been specified in an initial 
image) [52, 43, 36, 16, 50], track multiple objects over a 
sequence of images (especially when the objects inter- 
act or occlude one another) [54, 18, 66, 32, 44, 19], seg- 
ment a sequence of images into distinct spatiotemporal 
regions [10, 55, 61], and combine the previous methods 
in some way to create systems capable of both detecting 
and tracking specified objects [47, 7, 67, 38, 41], or of 
discerning which regions of a video constitute distinct, 
arbitrary objects and tracking these [8, 9, 25, 49, 48]. 

Methods that discern between the foreground and 
background regions of a video allow for data to be ex- 
tracted from the areas in each frame where objects re- 
side. Frame differencing and background subtraction 
are two such methods. Both record locations that ex- 
hibit motion relative to the background. Often, back- 
ground subtraction refers to methods that compare an 
image containing targets with an image of the back- 
ground only or with some model of the background that 
is learned as the video progresses [51], while frame dif- 
ferencing refers to methods that compare pairs of con- 
secutive images in a video [64]. Frame differencing has 
been used as the sole extraction method for object lo- 
calization or tracking schemes with success [49, 3, 14], 



and also as a secondary data extraction method to help 
improve the accuracy of object tracking [50]. 

A great deal of research has focused on develop- 
ing algorithms to track multiple objects simultaneously. 
There has been a particular emphasis on developing 
ways to deal with problems such as object occlusion 
(where one object blocks another from the view of a 
video camera) [54, 18, 66], complex object interactions 
[38, 44, 19], objects with similar appearances [42, 36], 
variable (and potentially high) numbers of objects [53], 
and objects that enter and exit a field of view at differ- 
ent times [58, 46]. Multiple independent single-object 
trackers running simultaneously have been shown to 
be ineffective, as they will tend to coalesce and track 
the same object [38] . To remedy this problem, methods 
have incorporated probabilistic principles for maintain- 
ing isolation of object trackers [42]. An approach to this 
problem involving the use of a nonparamctric mixture 
model has also found success in maintaining isolation 
of distinct objects [60]. 

Over the past decade, there have been attempts to 
provide general algorithms for the fully unsupervised 
detection and tracking of arbitrary objects in videos. 
Blob tracking, a basic method for carrying out this 
goal, has found success in videos where objects are 
easily isolated from the background and where local- 
ization and segmentation of distinct objects is possi- 
ble [26, 34]. Blob tracking methods, however, run into 
problems when faced with videos where detection is dif- 
ficult, object appearance or orientation varies heavily, 
and there exists object occlusion [57]. To improve the 
accuracy of these methods, techniques have been de- 
veloped for performing extraction and segmentation in 
a joint manner, incorporating statistical methods for 
maintaining hypotheses of different numbers of detected 
objects, and introducing some of the multi-object track- 
ing methods described previously to track distinct blobs 
after they have been segmented [15, 34]. Another family 
of methods related to the task of unsupervised detection 
and tracking of video objects goes under the heading of 
video segmentation algorithms — these methods extend 
single-frame image segmentation to maintain coherence 
of image segments over time, and have had some suc- 
cess when used for the explicit purpose of detecting 
and tracking foreground objects in videos [10, 55, 61]. 
Other attempts to perform unsupervised detection and 
tracking include methods for clustering short sequences 
of positions extracted by detecting the motion of ob- 
jects [8, 9], which aim to return full-length distinct ob- 
ject tracks, and graph based methods that carry out a 
similar task using spectral clustering [25]. Another ap- 
proach uses a Gaussian mixture model to cluster data 
extracted from moving objects [49]; this method also 
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develops heuristics for the initialization and elimination 
of new object tracks. 

Nearly all high accuracy object detection and track- 
ing methods are those tailored for specific object types. 
These methods rely on detection criteria that exploit 
knowledge about the appearance or behavior of the ob- 
jects in a video. Some of these methods make use of 
state-of-the-art detectors designed to locate the spec- 
ified objects of interest. In contrast, the method de- 
scribed in this paper is designed to track arbitrary ob- 
jects without using any explicit detection criteria, and 
serve as a general strategy that can be used, with- 
out modification, to perform accurate detection and 
tracking of diverse objects in a wide range of videos. 
The technique we introduce falls into the category of 
clustering-based arbitrary object detection and track- 
ing methods. Differing from previous work, we use a 
type of time dependent Bayesian nonparametric mix- 
ture model, and show how it can be applied to a variety 
of easily extracted data to perform detection and track- 
ing. This method begins by performing a simple data 
extraction procedure that yields noisy data. Our model 
of this data serves as a general framework for which 
we can choose a variety of object appearance distribu- 
tions and inference algorithms; each choice provides a 
new method for unsupervised detection and tracking of 
arbitrary objects in videos. 



3 Data Extraction 

We desire a data extraction procedure that yields ob- 
servations of the form 

x= (x s ,x c ,f) = (x Sl ,x S2 ,x Cl ,...,x cv ,t) (1) 

where each x corresponds to a point within an image 
region where an object (or foreground element) is be- 
lieved to reside, x s <G R 2 denotes the spatial location of 
this point, x c e Si x . . . x Sy denotes some collection 
of local image features in the vicinity of this point, and 
t e {1, . . . , T} denotes the time index. 

We'd like to use an extraction procedure that is as 
unsophisticated as possible. Consequently, we use frame 
differencing. This procedure locates the pixels in im- 
age regions that undergo change. Specifically, at each 
frame the pixels that differ from the previous frame 
beyond some threshold arc recorded. Here, each pixel 
corresponds to an observation x. Frame differencing is 
simple, computationally inexpensive, and able to be ap- 
plied to a wide range of static, single-camera videos 
(the videos used in experiments are stationary; moving- 
camera videos require data extraction methods useful 



for non-stationary video [12, 65]). Examples of pixel lo- 
cation data extracted via frame differencing are shown 
in Figure l(a-j). 

We also extract features x c that capture image in- 
formation in the vicinity of each extracted pixel. Ex- 
amples of possible features include color distributions, 
pixel intensity values, feature point (such as corner, 
shape, or edge) locations or spatial characteristics, and 
texture representations. In principle, we can extract any 
image features that may be used to characterize the ap- 
pearance of objects. In our implementation, we choose 
to extract only color information in the vicinity of each 
pixel. Incorporating color features allows our method to 
infer a distribution over color for each detected object; 
this improves its ability to distinguish between adjacent 
objects and track objects through occlusion. To add this 
information, we let x c represent a V dimensional dis- 
crete distribution over some aspect of color (such as the 
hue) in the pixel's immediate vicinity. Details on this 
vector and how it is computed are given in Section 7.2. 
We refer to the components of this vector as the "color 
counts" of the pixel. 

4 Model Framework 

We use a type of dependent Dirichlet process mixture 
(DDPM) model known as the Generalized Polya Urn 
dependent Dirichlet process mixture (GPUDDPM). We 
define a general form of this model in Section 4.1, and 
specify the distributions used in our implementation of 
this model in Section 5. We also define a secondary 
form of this model in Section 6.1, which is used in one 
of the two inference algorithms. We provide a brief in- 
troduction to mixture models and the Dirichlet process 
in Appendix A. 

Dirichlet process mixture (DPM) models (Section 
A. 4) fall under the heading of Bayesian nonparamet- 
ric models. These models have been widely used in the 
past decade to perform nonparametric density estima- 
tion and cluster analysis. Data extracted from videos 
comprises spatiotcmporal clusters, each corresponding 
to a distinct object. Consequently, we are interested in 
using a class of models known as dependent Dirichlet 
process mixture (DDPM) models (Section A. 5), which 
are particularly useful for estimating the number of la- 
tent classes (clusters) in time dependent data. 

Since objects can enter and exit a scene, the num- 
ber of clusters present throughout the video may not 
be constant (i.e., clusters may be created, or be "born", 
and may disappear, or "die" , at intermediate time steps). 
To cluster data with these properties, we choose to use 
a DDPM known as the Generalized Polya Urn depen- 
dent Dirichlet process mixture [11]. This model may be 
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Fig. 1: Three pairs of consecutive frames and the results produced by taking the pixel- wise frame difference 
between each pair (a - i). The final image shows the results of frame differencing over a sequence of images (from 
the PETS2009/2010 dataset. 
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viewed intuitively as a sequence of DPMs, where there 
exist dependencies between the parameters and num- 
ber of clusters in adjacent time steps. Like the DPM, 
the GPUDDPM allows for a distribution over the num- 
ber of clusters within a dataset — which, in this work, 
corresponds to the number of objects in a video — to be 
inferred. 



4.1 Generalized Polya Urn Dependent Dirichlet 
Process Mixture Model 

In a GPUDDPM, each observation x itt is associated 
with an assignment variable c^t that represents its as- 
signment to a cluster k ,t- The sizes of clusters in the 
GPUDDPM increase when observations are assigned 
to them, and decrease at later time points when these 
observations become "unassigned" . Wc define a distri- 
bution over the size of cluster k at time t (rrik,t), con- 
ditioned on the cluster's previous size (m^t-i); the as- 
signments at time t (ci-jv t t), and a deletion parameter 
(p)> 



D(mfc,t|mfc, t -i, c 1:Nut , p) ■= Binomial(m M _i 
- m k ,t + ^I(ci, t = k)\ m k ,t-i, p) 

i=l 



(2) 



Vfc € {1, . . . , Kf}, where K t is the number of clusters 
at time t and I(cj^ = k) is an indicator function whose 
value is 1 if Q t t — k and otherwise. 

We also define a distribution over the assignment of 
observation i at time t (cj jt ), conditioned on the sizes 
of all clusters at time t {m\-K t ,t) an d a concentration 
parameter (a), 



C(a. t = k\m 1:Kt ,t, a) 

m k ,t 



.— ) Efcii mk,t+a 

oc 

J2k=l m k,t+a 



if fce{i,. ..,#*} 
if k = K t + 1 



(3) 



Vi € {1, . . . , Nt}, where there exists K t clusters at time 
t and we give a newly created cluster the index K t +\. 

Distributions (2) and (3) together comprise what is 
referred to as the "Generalized Polya Urn" [11]. We can 
now define the GPUDDPM generatively as 



m k ,t\mk,t-i, c 1:Nt ,t, P 
0k,t\8k,t-i 



D( m M-i' °i--N t ,t, p) 
{P{0 k ,t\6 k ,t-i)i{k<Kt 
[Go iik = K t+1 
Ci,t\mi : K t ,t, a ~ C(mi ;Ji - t)t , a) 

*-l.t\ C i,t,Q\:K t ,t ~ F(^Ci, t ,t) 



V times t £ {1, . . . , T} and each cluster k € {1, . . . , K t } 
at time t, where we choose application-specific distri- 
butions for F, G , and P (#fc,t|#fc,t-i) in Section 5. The 
graphical model associated with this formulation of the 
GPUDDPM is shown in Figure 2. 




Fig. 2: Graphical Model of the Generalized Polya Urn 
dependent Dirichlet process mixture. The observations 
at time t, x 1:A r tit , and their associated assignments 
Ci-.N t ,t are denoted respectively as x t and c t (likewise 
for those at time t — 1). 



5 Model Specifics 

Sections 5.1-5.3 detail the object representation distri- 
butions chosen to fully specify our implementation of 
the GPUDDPM. This specification is used for all ex- 
periments in Section 7. The distributions F, G , and 
P{9k,t\6k,t-i) represent object appearance, the appear- 
ance prior, and object movement, respectively. Our spec- 
ification is kept general to allow for wide applicability, 
though one could choose to incorporate known object 
appearance or motion information for a specific track- 
ing application in future studies. 



5.1 Object Appearance and Mixture Component, F 

At a given time t, we model each observation x <G X as 
a draw from the product of a multivariate normal and 
multinomial distribution 



F(x|0) =AA(x s |/2,i:)7Wn(x c |p) 



(5) 



(4) 



where 9 = {fi, S, p} denotes the parameters of a cluster 
at time t, with mean \i S IR 2 , covariance matrix S G 
M 2x2 , and discrete probability vector p = (p\, . . . ,pv) 
such that Y^i=\Pi — 1- Additionally, A/ denotes the 
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multivariate normal distribution and Ain denotes the 
multinomial distribution. Note that we withhold writ- 
ing the subscripts specifying the cluster number k and 
time t in this section when writing them is unnecessary. 

The multivariate normal distribution over the spa- 
tial features x s models the position and spatial extent 
of an object. It can be thought intuitively to represent 
the shape of each object as an oval. We model the color 
features x c as draws from a multinomial distribution. 
Incorporating this distribution into our cluster likeli- 
hood allows us to exploit our observation that the pix- 
els associated with distinct objects tend to have similar 
color count vectors. 



5.2 Appearance Prior and Base Distribution, Go 

Go denotes the base distribution of the DDPM; it also 
serves as a prior distribution for the parameters 8 = 
{fi, S, p} of the mixture components (i.e. of the object 
appearance distributions). We use conjugate priors in 
the base distribution to allow for more efficient compu- 
tation. Specifically, in our implementation, a normal- 
inverse- Wishart prior is placed on the multivariate nor- 
mal parameters {/x, £}, and a Dirichlet prior is placed 
on the multinomial parameter p. The prior can there- 
fore be written 



G (6>) = AfiW(n, E\n , K , v Q , A )Vir(p\q ) 



(6) 



where AfiW denotes the normal-inverse- Wishart distri- 
bution, T>ir denotes the Dirichlet distribution, and the 
prior has the hyperparameters fi ,K ,v 07 A and q . 



5.3 Motion Model and Transition Kernel, P(d t \9 t -i) 

The transition kernel P(d t \9 t -i) represents how we ex- 
pect tracked objects to move over time. Since our imple- 
mentation is intended for tracking arbitrary objects, we 
do not wish to make sophisticated assumptions about 
object motion. For example, we choose not to incor- 
porate complex objects dynamics, though they are of- 
ten used with success in certain object-specific tracking 
tasks, such as people tracking [13, 17]. We assume only 
that the position of an object at a given time is close to 
its position at the previous time, and that the position 
varies in all directions equally between time steps. 

The base distribution Go must be the invariant dis- 
tribution of the transition kernel P(8 t \8 t -i) in order for 
the the cluster parameters to remain marginally dis- 
tributed according to the base distribution, and for the 
model to be a valid GPUDDPM. In other words, the 



transition kernel must satisfy 



l {6 t -i)P{e t \9 t -i)dB t -x = G o (0 t ) 



(7) 



for a given cluster with parameters 8. One way to achieve 
this is through the use of auxiliary variables. These are 
a set of M variables z t = (zt,i, ■ ■ ■ , z t ,M) associated 
with each cluster at each time t that satisfy 

P(8t\8 t -i) = J P(8 t \z t )P(z t \8 t ^)dz t (8) 

With the addition of these variables, the parame- 
ters of a cluster at a given time do not depend directly 
on their value at the previous time; they are instead 
dependent through an intermediate sequence of vari- 
ables. This allows the cluster parameters at each time 
step to be marginally distributed according to the base 
distribution Go while maintaining simple time varying 
behavior. 

Each of the auxiliary variables z t>m is drawn from 
the product of a multivariate normal and multinomial 
with the associated cluster parameters 9 t — {/z t , S t , pt} 



iMt, Zt,Pt ~ Af{fJi t ,I!t)Mn{pt) 



(9) 



Vm G {1, . . . ,M}. To satisfy (8), we specify the de- 
pendencies of a given cluster on its associated set of 
auxiliary variables at each time t by 

£t t , S t , Pt|z t <~ AfiW(fi M , k m , v M , A M )Vir(q M ) (10) 



where (i M , n M , vm , Am , and c\ M are 

km = «o + M 
v M = v + M 
n 



Mm: k + M 
A m = A + S z 



Mo 



M 

Kq + M 



qM = qo 



M 

Z t,m 



(11) 

(12) 
(13) 
(14) 

(15) 



and where M is the number of auxiliary variables, {fj, , 
Ko, vq, Aq} are the AfiW prior parameters, and qo 
is the T>ir prior parameter. We use z s and z c to re- 
spectively denote the spatial and color features of an 
auxiliary variable z, and z and S z to respectively de- 
note the sample mean and sample covariancc for a set 
z = {zi, . . . , zm} (of auxiliary variables, in this case), 
which we can write as 

M 

= I ] /M (16) 




(17) 
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5.4 Recap of Model Parameters 



6.1 MCMC: Batch Inference 



The multivariate normal-multinomial GPUDDPM for 
object tracking has a number of parameters, which are 
used to specify the object appearance prior distribution, 
transition kernel, and Generalized Polya Urn distribu- 
tions (C and D). The object appearance prior parame- 
ters include: 



Ho e R 2 


Mean position prior. In experiments 
performed in Section 7 the data was 
recentered to the origin, and this pa- 
rameter was set to (0, 0). 


k e M 


Scale factor of the mean prior. 


A G K 2x2 


Shape factor of the covariance prior. 


v Q € 7L+ > 1 


Scale factor of the covariance prior. 




Scale factor of the multinomial 
prior. 



The following parameter dictates characteristics of ob- 
ject movement. 




Number of auxiliary variables. 
A larger number will produce a 
smoother object path. 



Additionally, one can tunc the model's tendency to de- 
tect new objects and maintain the existence of these 
objects (both dictated by distributions C and D) with 
the following parameters. 





The concentration parameter for the 
Dirichlet process. A higher value will 
increase the tendency for new ob- 
jects to be detected. 


P G (0, 1] 


The deletion parameter. A higher 
value will give objects an increased 
tendency to die off. 



6 Inference 

Bayesian inference is used to achieve detection and track- 
ing results. Previously developed inference strategies 
can be applied to the generative model defined in Sec- 
tion 4.1 and Section 5. We provide details on the two 
Bayesian inference algorithms implemented in this stu- 
dy. The first is a type of Markov Chain Monte Carlo 
(MCMC) batch inference, which uses Gibbs sampling 
to generate samples from the posterior distribution of 
the model. The second is a type of Sequential Monte 
Carlo (SMC) inference, also known as a particle filter, 
which generates samples from the posterior distribution 
of the model in a sequential manner. 



This section details the MCMC sampler used to per- 
form inference. A secondary formulation of the GPUD- 
DPM, which we refer to as the "Deletion Variable For- 
mulation" (defined in Section 6.1.1), is used here. This 
formulation is equivalent to the formulation given in 
Section 4.1, but allows for easier sampling. 

6.1.1 Deletion Variable Formulation 

Instead of incorporating the cluster size variable m^ t 
directly in the GPUDDPM model (as it was in the 
definition given by (4)), we can formulate an equiv- 
alent model which makes use of new set of variables 
called deletion variables. We introduce a deletion vari- 
able di t t for each observation Xj.t, which denotes the 
time at which the observation is removed from its as- 
signed cluster. At each time, cluster sizes m,k,t can be 
reconstructed from all previous assignments and dele- 
tion variables by 



m k . t 



£*[( 



c? = k) A (t < dr)] 



(18) 



where ![•] is an indicator function that evaluates to 1 
if its argument is true, and otherwise. Additionally, 
we can define the deletion variable d^ t to be c?j jt = 
t + li t t, where can be thought of as the lifetime of an 
assignment. From the definition of distribution D (given 
by (2)), the lifetime can be shown to be distributed 
geometrically, and can be written as 



k,t\p ~ p{l - p) 1 ^ 



(19) 



where the parameter p is the same as that in (2). 

We can now define the Deletion Variable Formula- 
tion of the GPUDDPM generatively as 



di,t\p ~ Geo(p) + t + l 

^ (P(0k,t\0k,t-i)i{k<K t 
" \G if k = K t + l 

Ci,t|ci:t-l,dl :t _l, a — C(ci :t -1, dl :t -l, Oi) 

Xi,t\ci,t,6ci, t ,t ~ F (#c M ,t) 



(20) 



V times t G {1, . . . , T} and clusters k € {1, . . . , K t } 
at time t, where c t = Ci-,N t ,t, d t = d\-N t ,t, and the 
distributions F, Go, and P(9k,t\@k,t-i) are described in 
Section 5. This formulation of the GPUDDPM is also 
used by [28] and [11]. The associated graphical model 
for this formulation is given in Figure 3. 

For easier notation, we define x t = 'x.\-.N t ,t-, c t = 

Cl:N t ,t, d t = d\:N tt f,Qt = 6l:K t ,t, and Z t — Z\-.K t ,t,l:M- 
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Fig. 3: Graphical model of the Deletion Variable Formu- 
lation of the GPUDDPM, showing auxiliary variables. 



At each time t, the sampler moves sequentially through 
the N t observations, sampling the assignment Cj ;t and 
the deletion variable d it for each. Afterwards, the clus- 
ter parameters for all active clusters at t are sampled, 
and the M auxiliary variables for all active clusters at t 
are sampled using Metropolis Hastings (MH). The fol- 
lowing sections detail the distributions from which each 
of these samples is drawn. 

6.1.2 Sampling Assignment Variables, Ci. t 

A value proportional to the posterior probability can be 
computed for each possible value that C; it may take on. 
These values allow us to construct a discrete probability 
distribution from which we can draw samples from the 
posterior distribution over assignments. The possible 
values that the c^j may take on are k € {1, . . . , K t + 1}, 
where K t denotes the number clusters with a non-zero 
size at any time t' € {t, . . . , di^}, and K[ + 1 denotes a 
"new" cluster. The probability that x itt is assigned to 
cluster k, i.e. that c^t = k, given values for all other 
variables in the model (which we denote as "..."), is 
given by 

N t 

p(ci, t = k\ ...) oc JJ C((v it |m t ,a) 

d ilt N t , 

x II II C(c^ t ,|m t ,,a) 

t'=t+l i'=l 

x (F(xi tt \e k , t ) tfk<Kl 

X \ J P( Xl . t \6)G {9)dd if k = K' t + 1 

(21) 

where C is given by (3), and the cluster sizes m t > are 
calculated under the assumption that Cj it = k. Note 



that the above integral has an analytic solution for our 
specific model 



/ 



P(xj|0)G o (0)d0 



= t 



u — l 



Men 



n 



Aq(k + 1) 

^o(fo - 1) 

r(x?) r(Ej =1 qo) 



(22) 



l = \ r(q„) r(E; =1 xf) 

where t denotes the multivariate t-distribution, where 
we follow the three- value parameterization (location pa- 
rameter, scale parameter, and degrees of freedom) given 
in [30, 62], and {/x , k , A , fo, qo} are prior parame- 
ters. 

If a new cluster is sampled as an assignment, the 
cluster parameters and auxiliary variables for this new 
cluster must be initialized for all time steps before sam- 
pling can proceed. In our implementation, newly sam- 
pled clusters were initialized by iteratively sampling for- 
ward to time T and backwards to time 1 via the tran- 
sition kernel. 

6.1.3 Sampling Cluster Parameters, k ,t 

The conjugacy of appearance model and transition ker- 
nel distributions allow us to easily sample from the pos- 
terior distribution over the cluster parameters, which 
we can write 

P{6 k ,t\ • • •) - P(XiM,t)P(Zk,t+l,l:M\O k ,t) 
X P{dk.t\ Z k,t.l:M) 

= AfiW(iJ. kit , E ktt \n N , k N ,v N ,A N ) 

x £>ir(p fc)t |qjv) 

Where the parameters in the above distribution are 
given when the observations x 1: jv t ,t, and auxiliary vari- 
ables 2fc i-i:t,i:M f° r cluster k at time t — 1 and t, are 
taken to be the "observations" for the following Bayesian 
updates 

k n = k + N (24) 
v N = v a + N (25) 
N 



(23) 



Mat = 



K 



Mo + 



K + N 
An = Aq + S x s 

N 

qw = qo + x 



k + N J 



(26) 
(27) 

(28) 



where N is the number of observations, {Hq, kq, vq, Aq} 
are the AfiW prior parameters, qo is the T>ir prior pa- 
rameter, x s and x c respectively denote the spatial and 
color features of the observations, and x and S x respec- 
tively denote the sample mean and sample covariance 
for the set of observations x, defined in (16) and (17). 
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6.1.4 Sampling Auxiliary Variables, Zk,t,m 

The posterior distribution over each of the auxiliary- 
variables Zk,t, m can be written as 

P(Zk,t,m\ ■ • •) °C P{Zk,t,m\h,t-l)P{h,t\ Z k,t,l:M) (29) 

We sample a new value for all M auxiliary variables, 
denoted z ktm , using MH, with the proposal distribu- 
tion 



Z k,t,m ~ P{ z k,t,m\Qk,t) 



: N(zi^ m \iJ, ktt ,£k,t)Mn(z c k ^ m \vk,t) 



(30) 



and compute the standard MH acceptance ratio, which 
in this case simplifies to 



"accept — 



m, t -ii< f ,j 

P{9k,t-l\Zk,t,m) 



(31) 



6.1.5 Sampling Deletion Variables, d, L 



Sampling deletion variables could be performed in a 
manner similar to how we sample the assignment vari- 
ables in Section 6.1.2, but this may be computationally 
expensive due to the large number of possible deletion 
times. To remedy this, the MH algorithm may again be 
used to generate samples d* t from the posterior distri- 
bution over possible deletion times, where we use the 
proposal distribution 



h 
d* 



i,t 



Geo(p) 
h,t + 1 + 1 



(32) 



where k it denotes the geometrically distributed "life- 
time", and we accept or reject this sample using the 
process described in Section 6.1.4. 



6.2 SMC: Sequential Inference 

This section details a Sequential Monte Carlo (SMC) 
sampler — also known as a particle filter — used to per- 
form inference. SMC inference operates in the orginal 
GPUDDPM formulation (Section 4.1). The algorithm is 
shown in Algorithm 1. At each time step t G {1, . . . , T}, 
a number of samples referred to as "particles" are gen- 
erated; each particle consists of a sample from the pos- 
terior distribution over the assignment for each observa- 
tion, ci.t, . . . , Cjv tj t, parameters for each cluster 9ij, . . . , 
0K t ,ti and size after deletion for each cluster m\ it , . . . , 
rriK t ,f A set of particles is sampled at each time step 
from relevant proposal distributions (described in Sec- 
tions 6.2.1, 6.2.2, and 6.2.3), a weight is computed for 
each particle, and a new set of particles are sampled 



from the set of weighted particles via a resampling pro- 
cess. Within each time step, Gibbs sampling is used to 
generate the samples associated with each particle. 

The sequence of target distributions for the SMC 
algorithm may be written as 

7Tt(cl:t, 6>i :t , m 1:t ) = 7T t _l (ci :t _i , ©u.t-\ , mut-i ) 



t { P(Ci,t |m t , @ t , Cut , Xl:JV t ) 



n 



fc=l 

K t 



k=l 



P(0k,t\h,t-i) if k<K t -! 



if k > 

x Y\_B(mk,t\mk,t-i,ci:N t -i,t-i,p) 

(33) 



where c t = Cu.N t ,u ®t = 0i-.K t ,t, and m t = mu.K t ,t, and 
D is given by (2). 

6.2.1 Proposal Distribution for Assignments 

The probability of assignments given current cluster 
sizes, cluster parameters, and the Dirichlet process con- 
centration parameter a can be written as 

P {ci,t\mu.K t ,t,0i:K t ,t,u) oc C(mu.K t ,t,a) 

x l>(x M |0 c . t , t ) if k<K t -! (34) 
X \ f P(^, t \9)G o (0)d9 k>K t _ 1 

where C is defined in (3), and J P(xi jt \6)Go(9)dO can 
be determined analytically, and is given in (22). 

6.2.2 Proposal Distribution q\ 

The following is a distribution over cluster parameters 
9k.t given a set of observations xi : jv,t- We define qi 
to be 



9l(^,t|xi : JV,t) = P{0k,t\ X ~L--N,t) 



(35) 



where samples can be drawn from P(9k,t\*-i:N,t) using 
the Bayesian updates found in (24), (25), (26), and (28), 
where the X-u.N,t are taken to be the observations. 

6.2.3 Proposal Distribution q2 

The following is a distribution over cluster parameters 
9k,t given a set of N observations x.u.N,t and the cluster 
parameters at a previous time, 9k,t-i- We define q2 to 
be 

q2{0k,t\@k,t-l,X-UN,t) = P{9k,t\9k,t-l,~X-\:N,t) (36) 
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Algorithm 1 Sequential Monte Carlo Inference for the GPUDDPM 



1 

2 
3 
4 
5 
6 
7. 
8 

9 
10 
11 

12 

13: 

14: 
15: 

16: 

17: 

18: 
19: 

20: 

21 
22 

23 
24 
25 
26 
27 
28 
29 
30 

31 
32 
33 
34 
35 
36 
37 

38 

39 
40 
41 

42 
43 
44 



for I = 1 
end for 

-co 

'0 

for t 



: L do 

1/L 



KX 1 «- 

1 : T (# of frames) do 
for I = 1 : L (# of particles) do 

(0 , (0 

1:K\ ' ,t l:K\l l ,t-l 

for s = 1 : S (# of Gibbs samples) do 

for i = 1 : Nt (# of observations at frame t) do 
if s = 1 then 



> initialize weights 
> initialize # of clusters 



Sample 



(0 



1 1: 



.(!) 



q(0 



<-<!> 



> eq. (34) 



,(0 



+ 1 



else 



,(0 



Sample cf\ 

(0 



,(0 



1 

,(0, m 0) 



,(0 

1:*T<' 



end if 

if c^ t ' = + 1 (a new cluster) then 



Sample 9 



K 



(0 

.(I) 

»,t' 



<?l(x;. 



end if 
end for 

for k = 1 



A' 



(# °f clusters at frame t) do 



(0 



if fc > K\ l \ then 

Sample 0^ ~ gi({xi 



N t ,t 



k}) 



else if k < K^_ 1 and #{xi : jv (i< = A;} > then 

9 fc,t-l>{ X l:JVt,* = 



Sample 6>£ t 



92 (, 



else if ml t > then 



"k,t 
Sample 6>j?° t 
end if 
if s = S then 

Sample m^' + 1 
end if 
end for 
end for 



D(m 



(0 .(0 
fc,t' L l:JV t 



,t>P) 



f(*i:iv t ,t.<Sv t , t l» 1; 



«;(') <- 

end for 
for 1 = 1 

..,(0 ^ 



(0 



l l:K t ,t> 



P(c (l} lm ( " 9 (i> x, m- 



L do 

-((> 

w l 

-(!) 



end for 

Resample particles 1, . 
end for 



, L and weights w^, 



> eq. (34) 



> eq. (35) 



> eq. (35) 

> eq. (36) 
> eqs. (9) & (10) 

> eq. (2) 



> normalize weights 
t> Section 6.3 
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where samples can be drawn from P(Ok,t\Qk,t—i' x -i:N,t) 
using the Bayesian updates found in (24), (25), (26), 
and (28), where both the x 1:J v,t and auxiliary variables 
Zk,t,i-.M are taken to be the observations. 

6.3 Resampling Particles and Particle Weights 

At each time step t € {1, . . . , T}, after all of the L par- 
ticles have been sampled and their associated weights 
computed, a resampling step is carried out. In this step, 
L new particles are sampled from current set of L par- 
ticles. Resampling strategies such as those described in 
[28] and [21] can be used. 

6.4 From Inference to Tracking Results 

Each inferred cluster is taken to be a distinct object, 
and the sequence of means and covariance matrices for 
a given cluster are used to determine the position and 
spatial region, respectively, of a given object over a se- 
quence of time steps. In particular, the mean param- 
eter is taken to be the centroid of an object, and a 
2-dimensional oval centered on the mean that contains 
a specified percentage of the normal distribution mass 
(where we refer to the specified percentage as the con- 
fidence value) is taken to be the spatial region of the 
object. We report the maximum a posteriori (MAP) 
sample as our result. 

7 Experiments 

This section provides details on performance evaluation 
metrics that have been developed to quantify results 
in object detection and tracking studies (and adopted 
in this paper), synthetic video experiments that ver- 
ify aspects of the developed technique, and benchmark 
video experiments that demonstrate the performance 
of this technique in relation to other strategies (includ- 
ing state-of-the-art, object-specific strategies) that have 
been developed in recent years. 

7.1 Performance Evaluation Metrics 

Performance evaluation metrics, which provide a stan- 
dardized way of quantifying the success of a detection 
and tracking procedure on a given video, have started 
to become consistently used in the past four years. The 
metrics presented in [37] and used in [22, 59, 40] have 
become well established for evaluating the performance 
of object detection and tracking in videos and have been 



adopted by the Video Analysis and Content Extraction 
(VACE) program and the Classification of Events, Ac- 
tivities, and Relationships (CLEAR) consortium, two 
large-scale efforts concerned with video tracking and 
interaction analysis. The two metrics used to quantify 
the experimental results in this study are known as the 
Sequence Frame Detection Accuracy (SFDA) and Av- 
erage Tracking Accuracy (ATA). Details on how these 
metrics are defined and computed arc given in Ap- 
pendix B. 

The above metrics are dependent upon ground-truth 
data specifying the positions of each object in each 
frame throughout a video sequence. In the experiments 
described below, we recorded the synthetic video ground- 
truth during construction of the videos (described in 
Section 7.3), and we used the Video Performance Eval- 
uation Resource (ViPER) ground-truth software [20], 
an open source tool commonly used in the video track- 
ing community, to author ground-truth data for each of 
the benchmark datasets. 

The ground-truth authored by the ViPER tool took 
the form of bounding boxes denoting the spatial posi- 
tion of each object at each time step. Consequentially, 
to find the spatial overlap between results and ground- 
truth, which is intrinsic to both metrics, a rectangular 
bounding box was needed per object per time step from 
the results of the algorithm. We took the maximal and 
minimal axially aligned values of the oval inferred by 
our algorithm (as described in Section 6.4) to be the 
sides of a representative bounding box for a given ob- 
ject at a given frame. 

7.2 Data Extraction in Experiments 

Frame differencing was used in all experiments to iden- 
tify pixels exhibiting motion. For each pixel x = (xi, 
x 2 , t) recorded during frame differencing, we also ex- 
tracted color information. Specifically, we specified a 
square, L pixels in length, centered on (xi,^), that 
contained a set of pixels surrounding x in frame t. We 
chose to capture the hue for each pixel. The set of pos- 
sible hue values (i.e. the range of hues to which a pixel 
may be assigned) was partitioned into V bins, and the 
number of pixels with a color value lying in each of the 
bins yielded the V dimensional vector of color counts. 
For all experiments, we chose V = 10. 

7.3 Synthetic Video Datasets 

Each of the following synthetic videos consists of a se- 
quence of 200 images (each of size 500 x 500 pixels) con- 
taining a number of smaller colored squares of different 
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Fig. 4: Each plot shows a sample from the posterior distribution of the model for synthetic experiments one (a and 
b) and two (c), where the vertical axis represents frame number, the horizontal axes represent spatial position, 
objects are denoted by marker colors and marker types, and the mean and standard deviation are shown. In all 
cases, the objects are successfully tracked through occlusion, whether they travel in a straight line (a), reverse 
direction (b), or do a combination of both (c). 



(and potentially time-varying) sizes moving at varied 
speeds and trajectories over a black background. The 
synthetic videos contain instances of occlusion (where 
one or more objects are briefly hidden) and objects with 
time- varying appearances and behaviors, as these noto- 
riously decrease the accuracy of detection and tracking. 
After each video was constructed, the extraction proce- 
dure described in Section 7.2 (using L = 3) and infer- 
ence procedures described in Section 6 were carried out 
to return a sequence of multivariate- normal-parameters 
(means and covariance matrices) , which are used to de- 
termine a sequence of positions and ovals approximat- 
ing, respectively, the locations and shapes of a tracked 
object over each frame that it is present in the video 
(as outlined in Section 6.4). 

The first synthetic video experiment aimed to test 
the ability of the model and inference procedure to 
maintain the identity of independent objects based on 
color information alone. Two videos were constructed, 
both containing a red square (rgb value [255,0,0] and 
size 20 x 20 pixels) and a blue square (rgb value [0,0, 255 
and size 20 x 20 pixels). In both videos, the squares be- 
gin at opposite sides of the scene at frame / = 1 and 
travel towards each other, arriving at the same loca- 
tion at / = 100 (where the blue square occludes the 
red square). The second half of the two videos differ 
in that both squares in the first video continue in the 
same direction and end at the other's starting position 
at / = 200, and both squares in the second video re- 
verse directions and end at their initial starting posi- 
tions at / = 200. The frame difference extraction yields 



identical spatial features in both videos; hence, success- 
ful tracking depends fully on the incorporation of color 
information into the model. 

Parameters were set to the same values for infer- 
ence on both videos: a — 0.1, p — 0.3, M = 10, fi = 
(0,0), k = 0.05,^o = 5, A = (i?),andq = (5,..., 5). 
Inference was carried out using the MCMC algorithm 
(Section 6.1); the MAP sample correctly tracked both 
colored squares through occlusion in both videos, and 
is shown in Figure 4. 

The second synthetic video experiment aimed to 
test tracking performance under occlusion, object ap- 
pearance change, and motion change. A video was con- 
structed showing a red square (rgb value [255,0,0]), a 
green square (rgb value [0,255,0]), and a blue square 
(rgb value [0,0,255]). The red square was of size 20 
x 20 pixels, the blue square was of size 15 x 15 pix- 
els, and the green square began at size 50 x 50 pixels 
at frame / = 1, linearly shrinks to 10 x 10 pixels at 
/ = 100, then linearly grows back to 50 x 50 pixels by 
the end of the video, / = 200. Furthermore, the red and 
blue squares display the same behavior as in the second 
video of the first synthetic experiment (they begin at 
opposite sides of the scene traveling towards each other, 
cross at the center of the scene at / = 100, and reverse 
direction, ending at their initial positions at / = 200). 
The green square begins at a point equidistant from the 
other two squares, intersects with them as they overlap 
(causing the blue square to occlude the other two) , and 
continues on in a direction at a 20 degree angle from its 
initial trajectory. 
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Parameters were set to the same values chosen in 
the first synthetic experiment. The MCMC inference 
algorithm correctly tracked all three objects through 
occlusion and inferred the appearance and size shifts. 
Figure 4 shows a sample from the posterior distribu- 
tion of the cluster parameters, where the mean and oval 
representation of the covariance matrix (with 0.5 con- 
fidence value) are overlayed on the data. The data is 
plotted with time on the vertical axis, and the assign- 
ment of each data point to one of the three inferred 
clusters is denoted by color and marker type. 

7.4 Benchmark Video Datasets 

Benchmark video datasets for object tracking and de- 
tection have been produced to provide standard scenes 
on which researchers can compare detection and track- 
ing results. These videos have been primarily produced 
for surveillance-related workshops — notably, for the In- 
ternational Workshop on Performance Evaluation of 
Tracking and Surveillance (PETS) — which provides re- 
searchers with video datasets and algorithmic goals on 
which to focus. Three commonly used benchmark videos 
from PETS workshops— one used in PETS2000, one 
in PETS2001, and one used both in PETS2009 and 
PETS2010 — were chosen to demonstrate the method 
presented in this study. The performance metrics and 
benchmark datasets allow the methods developed in 
this paper to be quantitatively compared against other 
detection and tracking algorithms. 

74.1 PETS2000 and PETS2001 

The PETS2000 and PETS2001 video datasets both con- 
sist of a small number of humans and vehicles travel- 
ing across a parking lot, with video taken from above, 
emulating what might be recorded by standard out- 
door surveillance equipment. The "Test Sequence", a 
set of images from a monocular, stationary camera, was 
used from the PETS2000 workshop, and "View Two 
of Dataset 1" , also taken via a monocular, stationary 
camera, was used from the PETS2001 workshop. The 
MCMC algorithm (described in Section 6.1) was used 
for inference in these experiments. 

Due to the computation required for the MCMC 
batch inference method (discussed further in Section 8) , 
only the final 1000 frames of the video were used from 
both datasets. Extraction was performed with frame 
differencing as described in Section 7.2, using L = 3. 
Parameter values were set to the same values as in the 
synthetic experiments (a — 0.1, p — 0.3, M = 10, fi = 
(0,0), k = 0.05,^o = 5,Ao = (J?), andq = (5, 
...,5)). The MCMC sampler was successful for both 



benchmark videos; each object was detected, tracked, 
and its shape estimated in manner very consistent with 
the ground-truth. The results for the PETS2000 dataset 
are displayed in Figure 5a and for the PETS2001 dataset 
in Figure 5b; in these figures, a sample from the poste- 
rior distribution of the cluster parameters is overlayed 
on the extracted data over a sequence of frames, where 
the assignment of each data point is represented by 
color and marker type. 

To calculate performance metrics (both SFDA and 
ATA), one must specify a confidence value that allows 
the oval representing the region occupied by an object 
to be computed from the inferred covariance matrix of 
each cluster (as discussed in Section 6.4). The perfor- 
mance metrics were found for a range of confidence in- 
tervals, and the resulting curves for both the PETS2000 
and PETS2001 video are shown in Figure 6. 

7.4.2 PETS2009/2010 

A video dataset used in both the PETS2009 and PETS- 
2010 conferences, called "S2.L1 at time sequence 12.34" 
was chosen for experimentation due to its prominence 
in a number of studies [22, 2, 4, 17, 5, 6, 29, 1, 63]. This 
dataset consists of a monocular, stationary camera, 794 
frame video sequence. The entire video sequence was 
used in this experiment. 

Due to the large number of frames and objects in 
this video, the SMC algorithm (described in Section 6.2) 
was used for inference. This method of sequential infer- 
ence was observed, on this dataset, to converge to a 
better sample in a shorter period of time in comparison 
with the MCMC algorithm. 

Extraction was performed with frame differencing as 
described in Section 7.2, using L = 3, and parameters 
for the model were chosen to be a = 0.1, p = 0.8, M — 
10, Hq = (0, 0), ko = 0.05, u = 6,A = { 1 ° 1 ), and q = 
(3, . . . , 3). Additionally, as with the other video datasets, 
ground-truth bounding boxes around each object were 
authored using the ViPER tool. 

The SMC inference algorithm yielded an estimate 
of the posterior distribution of the model, from which 
the object detection and tracking results were obtained 
(as described in Section 6.4). In Figure 7, the MAP 
sample from the posterior distribution over the cluster 
parameters is overlayed on the extracted data over a 
sequence of frames, where the assignment of each data 
point is represented by color and marker type. 

7.4-3 Comparison with Other Methods 

In [22], performance metrics (including the SFDA and 
ATA) were computed for a number of studies that car- 
ried out object detection and tracking for the PETS20- 
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(g) (h) (i) (j) 

Fig. 5: Results from the PETS2000 (a) and PETS2001 (b) dataset. Both plots show a sample from the posterior 
distribution of the state, where the vertical axis denotes time, the horizontal axes represent spatial position, color 
represents assignment, and the mean and standard deviation are shown. Below are four frames from the PETS2000 
(c-f) and PETS2001 (g-j) sequence with one posterior sample mean and covariance matrix representation shown 
for each frame (and one sample mean shown for the previous 20 frames). 



09/2010 dataset. As this dataset consists solely of hu- 
mans, all ten of the algorithms presented for compar- 
ison were developed for the specific purpose of people 
tracking (i.e. not for general detection and tracking of 
arbitrary objects). As a consequence, many of these 
studies use externally developed (and trained) state- 
of-the-art human detectors, exploit the orientation of 
the humans in this specific dataset, or apply motion 



models based on assumptions about human motion. In 
particular, Brcitcnstcin ct al. [6] base their tracking on 
ouput from an externally trained human-specific detec- 
tor; Yang et al. [63] assume they are tracking an upright 
person, and perform feet and head detection; Conte et 
al. [17] group foreground fragments based on geome- 
try of the human shape to be recognized and look for 
shadows often present in human surveillance scenarios; 
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Fig. 6: The SFDA (blue solid line) and ATA (red dashed line) vs confidence values from which an object's oval 
region is computed for (a) PETS2000 and (b) PETS2001 video datasets. 



Berclaz et al. [4] use an external detector that makes 
use of multiple camera views and models each human 
as a cylinder; Alahi et al. [1] base their method on mod- 
elling the silhouettes of humans; Bolme et al. [5] train 
a human specific detector; Ge et al. [29] provide their 
algorithm with estimates of typical human size and ori- 
entation; and Arsic et al. [2] localize human feet posi- 
tions. 

We compare SFDA and ATA results of our strat- 
egy with these methods to show that our arbitrary ob- 
ject framework can yield comparable results even when 
compared with object-specific trackers. Table 1 shows 
performance metric results for comparison (data pub- 
lished with permission from the authors of [22]). Our 
method achieves the fourth best SFDA and third best 
ATA. 



Method Name 


SFDA 


ATA 


Breitenstein [6] 


0.57 


0.30 


Yang [63] 


0.55 


0.45 


Conte [17] 


0.53 


0.06 


GPUDDPM 


0.51 


0.30 


Berclaz [4] 


0.48 


0.15 


Alahi 1 [1] 


0.43 


0.04 


Alahi 2 [1] 


0.42 


0.05 


Bolme 1 [5] 


0.41 


NA 


Ge [29] 


0.38 


0.04 


Bolme 2 [5] 


0.34 


NA 


Arsic [2] 


0.18 


0.02 



Table 1: SFDA and ATA performance metric results are 
shown for our method (in bold) and for ten other algo- 
rithms on the PETS2009/2010 benchmark dataset. Re- 
sults are listed in descending order of the SFDA value. 
The results were provided by the authors of [22]. 



7.4-4 Sensitivity Analysis 

SMC inference on the PETS2009/2010 video dataset 
was carried out for a range of the Generalized Polya 
Urn parameter values, a and p. The performance met- 
ric measures, SFDA and ATA, were computed for each 
combination of these two parameters. This sensitivity 
investigation focused on these parameters due to their 
potential to have a large effect on object detection accu- 
racies. The a values tested included {0.01, 0.1, 1, 10, 100}, 
and the p values tested included {0.7, 0.75, 0.8, 0.85, 0.9}. 

The SFDA and ATA achieve their maximal values 
at different a and p parameters, though both achieve a 
reasonably optimal value at the intermediate parameter 
values a = 10 and p = 0.85. Detection and tracking 



performance was also shown to be fairly robust to minor 
variations in these parameter values. 



8 Conclusion 

We have presented a new model for the unsupervised 
detection and tracking of arbitrary objects in videos. 
The primary intention of this technique is to reduce 
the need for detection or localization methods tailored 
to specific object types and serve as a general frame- 
work applicable to videos with varied objects, back- 
grounds, and film qualities. The GPUDDPM, a time- 
dependent Dirichlet process mixture, has been intro- 
duced, and we have shown how inference on this model 
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Fig. 7: (a) Results for the PETS2009/2010 dataset, showing a sample from the posterior distribution of the 
state for frames 1-50, where the vertical axis denotes time, the horizontal axes represent spatial position, color 
represents assignment, and the mean and standard deviation are shown, (b) performance metrics vs. covariancc 
confidence interval threshold. Below (c-f) are four frames with one posterior sample mean and covariance matrix 
representation shown for each frame (and one sample mean shown for the previous 20 frames). 



allows us to achieve detection and tracking results. Fur- 
thermore, we have demonstrated a specific implemen- 
tation of the model using spatial and color pixel data 
extracted via frame differencing and provided two algo- 
rithms for performing Baycsian inference on the model 
to accomplish detection and tracking. Both algorithms 
were carried out on multiple synthetic and benchmark 
multi-object video datasets in order to demonstrate an 
ability to accomplish unsupervised detection and track- 
ing of arbitrary objects in both manufactured and real 
world settings. We have described and computed stan- 
dard performance metrics for our technique's detec- 
tion and tracking results, and found it to be compa- 
rable with state-of-the-art object-specific detection and 
tracking methods designed for people tracking in the 
PETS2009/2010 video dataset. Results from the syn- 



thetic and benchmark video datasets illustrate the abil- 
ity of the technique described in this paper to, without 
modification, perform completely unsupervised detec- 
tion and tracking of objects with diverse physical char- 
acteristics moving over non-uniform backgrounds and 
through occlusion. 
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A - Appendix: Model Background 



A. 3 Dirichlet Process 



We give background on mixture models, Bayesian mixture 
models, the Dirichlet process, Dirichlet process (infinite) mix- 
ture models, and dependent Dirichlet process mixture models. 



A.l Finite Mixture Model 

A finite mixture model can be thought of as a probability 
distribution for an observation Xi formulated as a linear com- 
bination of K mixture components (which we also refer to 
as 'clusters'), where each mixture component is a probabil- 
ity distribution for Xi with some parametric form, and the 
coefficients of the linear combination sum to one. The finite 
mixture model can be written as 



P(xi) = P(a = k)P(xi\8 k ) 



(37) 



Vi e {1, . . . , N}, where Cj G {1, . . . , K} denotes the assignment 
of Xi to a given mixture and 9 k denotes the parameters of the 
k th mixture component. Note that by choosing P(c; = k) as 
coefficients of the linear combination, it is ensured that these 
coefficients sum to one. We also define p k := P(c, = k) for 
k e {1, . . . , K}. We can therefore write this model generatively 
as 



Ci\pi, ■ ■ ■ ,Pk ~ Discrete(pi, . . .,p K ) 



,) 



Vi e {1, . . . , N}, where the observations, the c, are the 

mixture component assignments associated with each obser- 
vation, the 9 Ci are parameters defining the c' h mixture com- 
ponent (i.e. the distribution to be mixed, F(0 Ci )), and the 
"Discrete" distribution refers to a multinomial distribution 
whose parameters are a 1-of-K vector (i.e. a vector of counts 
that sums to one). 



A. 2 Bayesian (Finite) Mixture Model 

The finite mixture model of Section A.l can be extended to a 
Bayesian mixture model by viewing parameters that were pre- 
viously point values, 9 Ci (the mixture component parameters) 
and pi, . . . ,pk (the mixture component assignment weights), 
as random variables and providing each with a prior distri- 
bution. In this case, the prior distribution Go is placed on 
the mixture component parameters, and the prior distribu- 
tion Dir (a/K, . . . ,a/K) is placed on the mixture component 
assignment weights. The resulting Bayesian mixture model 
can be formulated generatively as 



pi, ■ ■ ■ ,PK ~ Dir(a/K, . . . 
0i, . . . , 9k ~ Go 
c»|pi, . . . ,p K ~ Discrete(pi, 



,a/K) 



■ -,PK) 



(39) 



Vi e {1, . . . , AT}, where the observations, the c; are the 

mixture component assignments associated with each obser- 
vation, the 9 k are parameters defining the fc th mixture com- 
ponent (i.e. the distribution to be mixed, F(0j,)), the 9 k are 
drawn from a prior distribution Go, and pi, . . . ,Vk are drawn 
from a Dirichlet prior parameterized by a/K, ... , a/K. 



The Dirichlet process (DP), first introduced by [24] in 1973, 
may be intuitively viewed as a probability distribution over 
discrete probability distributions. Accordingly, draws from a 
DP are probability mass functions (PMFs). A DP is parame- 
terized by a base distribution Go , which is a probability distri- 
bution over a set 0, and a concentration parameter a S R+. 
We say that G is a random PMF distributed according to a 
DP, written G ~ DP(a, Go), if the following holds for all finite 
partitions A±, . . . , A p of 0: 



(G(A 1 ), G(A P )) ~ Dir(aG (Ai), . . . , aG (A p )) 



(40) 



Where 'Dir' denotes a Dirichlet distribution. The parameters 
Go and a may be intuitively viewed as the mean and precision 
of the DP. This is due to the fact that if the base distribution 
Go is a distribution over 0, A C 0, and G ~ DP(a, Go), then 
the following holds: 

E[G(A)} =G (A) (41) 

Var[G(A)] = G (A)(1 - G (A))/{a+ 1) (42) 

Hence, the expectation of G(A) is Go, the variance of G(A) — > 
as a — > oo, and G converges pointwise to Go when a is 
unbounded. 



^ 3g j A. 4 Dirichlet Process (Infinite) Mixture Model 



A DPM model, also refered to as an infinite mixture model, is 
an extension of the Bayesian mixture model described in Sec- 
tion A. 2. When using a DP as a prior in a Bayesian mixture 
model, represents the set of parameters of the component 
mixture distributions. A DPM may be viewed as allowing the 
prior distribution over the mixture component parameters in 
a standard mixture model to be distributed according to a 
DP; this allows for modeling data where the true number of 
latent mixture components is unknown and arbitrarily large 
by letting the number of components remain unbounded (note 
that only a finite number of these components are assigned to 
the data). In particular, the DPM can be defined generatively 



Xi\4>i 



DP(a,G ) 



i) 



(43) 



Vi e {1, ... ,N}, where the Xi circ observations, the <j>i are 
parameters defining the mixture component from which the 
ith observation is drawn (i.e. the distribution to be mixed, 
F (</>,)), and the 4>i are drawn from a prior distribution G, 
which is in turn drawn from a DP with base distribution 
Go and parameter a. See [27] and [28] for more details on 
this formulation. Note the difference between the indexing of 
the clusters in this model and the indexing in the previous 
two models. This formulation can be shown to be equivalent 
to the Bayesian mixture model defined in (39), when K is 
taken to be unbounded; as a result, this model is sometimes 
called an infinite mixture model. If we let K be the number of 
distinct mixture components assigned to observations using 
the above model, we can write the mixture components as 
01, ... , 9 K . We also let ci, . . . , c N (where Cj e {1, . . . , K}) be 
class assignment variables that indicate the cluster to which 
observation Xi is assigned. 
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A. 5 Dependent Dirichlet Process Mixture Model 

The goal of DDPM models is to allow modeling of data that 
is not independent and identically distributed but instead 
has some underlying dependencies. For example, data gener- 
ated during video extraction procedures have some associated 
temporal dependencies, since there exist similarities between 
features (such as those that encode the spatial positions or 
appearances of objects) of data at nearby time steps. 

To account for the dependent behavior of data, there has 
been research into models involving a sequence of DPMs, 
where components of the mixtures are dependent upon (or 
are sometimes said to be "tied to") corresponding compo- 
nents at neighboring positions in the sequence. For example, 
if the data shows temporal dependence, the goal might be to 
create a sequence of DPMs, one for each time-step, where the 
components of the mixture at each step are dependent upon 
corresponding components in the both the following and pre- 
vious time steps. 

More rigorously, we take the definition of a DDPM to be 
a stochastic process defined on the space of probability distri- 
butions over a domain, which are indexed by time, space, or a 
selection of other covariates in such a way that the marginal 
distribution at any point in the domain follows a Dirichlet 
process (adapted from definitions found in [28] and [31]). 
Hence, a time-dependent DDPM is a model which remains 
a Dirichlet process, marginally, at each time step, yet allows 
cluster parameters at a given time step to vary from (and 
remain dependent upon) the parameters in neighboring time 
steps. 



B - Appendix: Performance Metric Details 

This section provides details on the definition and calculation 
of the performance evaluation metrics, SFDA and ATA, used 
to quantify detection and tracking results in this study. 



B.l Mapping Ground-Truth to Output 

The problem of finding a mapping between a video's ground- 
truth tracks and an algorithm's output tracks is nontrivial, 
though necesary to solve, in order to compute the perfor- 
mance evaluation metrics used in this study. In short, the 
typical solution to this problem involves first specifying a per- 
formance metric and then choosing the mapping from ground- 
truth tracks to output tracks which yields the most favorable 
performance metric value. This process is described in de- 
tail by Kasturi et al. [37]; we follow the method outlined in 
this paper to find an optimal mapping. Similiar to descrip- 
tions in [37], we implement the Hungarian algorithm [45] as a 
polynomial-time (0(n 3 )) solution to the problem of optimally 
mapping two sets of tracks once the similarity between any 
two tracks given some specified metric is established. Addi- 
tionally, the method employed in [37] allows erroneous and 
undetected tracks to be left unmapped, which is both desired 
and necessary in the case where there is a different number of 
ground-truth and output tracks. Note that once a mapping 
from a collection of ground-truth tracks to a collection of re- 
sult tracks has been established, one can determine which 
result tracks are false positives (the result tracks to which 
no ground-truth track is assigned) and which ground-truth 
tracks are true negatives (the ground-truth tracks that are 



not assigned to a result track). The numbers of tracks dis- 
playing both of these failures are factors in the performance 
metrics used in this study. 



B.2 SFDA and ATA 

The two metrics used to quantify performance in this study 
are known as the the Sequence Frame Detection Accuracy 
(SFDA) and the Average Tracking Accuracy (ATA). These 
metrics were developed during VACE Phase II to provide a 
single, comprehensive metric to describe detection, and one 
to describe tracking. The following are used in the definitions 
of the performance metrics: 

• d denotes the spatiotemporal region occupied by the ith 
ground-truth object in a video, and G * denotes the re- 
gion occupied by the ith ground-truth object in frame t. 

• Di denotes the spatiotemporal region occupied by the ith 
detected object in a video, and D^' denotes the region 
occupied by the ith detected object in frame t. 

• Nq denotes the total number of unique ground-truth ob- 
jects in a video, and Nq denotes the number of unique 
ground-truth objects present at frame t. 

• No denotes the total number of unique detected objects 
in a video, and N^ denotes the number of unique de- 
tected objects present at frame t. 



N (l 



1 ; imp s 



denotes the total number of frames in a video, 
and N^ incs denotes the number of frames in which an 
object i (which can be a ground-truth or detected object, 
depending on the context) is present in a video. 

• ^mapped denotes the number of mapped ground-truth/ 
detect pairs in a video, and N^ appcd denotes the number 
of mapped ground-truth/detect pairs present at frame t. 

The SFDA metric quantifies the performance of an object 
detection algorithm as a function of the number of correct 
detects, false positive detects, missed (true negative) detects, 
and spatial allignment of detects relative to the ground-truth. 
The SFDA is calculated by computing the Frame Detection 
Accuracy at frame t (FDA' 4 ') for each frame in a video se- 
quence. The FDA provides a measure of the allignment be- 
tween ground-truth and detected objects in a given frame 
via the overlap ratio of a ground-truth/detect pair, defined 
to be the ratio of the intersection of ground-truth and de- 
tect regions to the union of ground-truth and detect regions. 
Formally, we can write 



FDA< J ) = 



where 



Overlap Ratio 



(44) 



Overlap Ratio = 



i |G„ (t) no} 



(0 



G„ (t) U D) 



(*) 



(45) 



The term JV^ppcd refers to an optimal mapping between 
ground-truth and detects at frame t as specified in section B.l 
using the FDA^ as the relevant metric. Given the FDA'*' 
at each frame, the SFDA can be computed; this metric may 
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be viewed as the average FDA over all frames of a video 
sequence. We define 

y-JV fram „ FDA (t) 
SFDA = Z - t - 1 -. r (46) 



where 3 (^N^ V N^^j yields a 1 if either a detected or ground- 
truth object is present in frame t and a otherwise. 

The ATA metric quantifies the performance of an object 
tracking algorithm as a function of the spatial overlap of a 
mapped set of sequences of detected object positions to a 
set of sequences of groundtruth object positions. The ATA 
is calculated by first computing the Sequence Track Detec- 
tion Accuracy (STDA), which can be viewed as a tracking 
performance measure unnormalied in terms of the number of 
objects. We can write the STDA as 
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where iV mappe d refers to an optimal mapping between ground- 
truth and detected objects as specified in section B.l using 
the STDA as the relevant metric, and N(Q iUDi -£$\ denotes the 
number of frames in which a given tracked object, the ground 
truth object to which it is mapped, or both, are present. 

Given the STDA for a video sequence, the ATA can be 
computed by the formula 



